Inside Indy 1993

home *** CD-ROM | disk | FTP | other *** search

/ Inside Indy 1993 / Inside Indy 1993.iso / demos / TRAIN / spanel.doc < prev next >

Wrap

Text File | 1993-06-23 | 10.9 KB | 230 lines

spanel(1) Name spanel - control panel for speech recognition Synopsis spanel [ -display displayName ] [ -application displayName ] [ -vocabulary vocabularyFileName ] [ -sound soundFileName ] [ -recThr # ] [ -program progName ] [ -background ] [ -output ] [ -topLevel ] [ -noAGC ] [ -noSave ] Description spanel comprises components of a speech recognition system. spanel has a graphical interfaces that allows the user to manage a set of known speech templates (a vocabulary). Templates can be added, deleted, modified, trained, associated with an action, saved or loaded as a set in a vocabulary file. spanel contains algorithms which frame isolated speech utterances (or sounds) and trains or matches these tokens against known templates in the current vocabulary. spanel can synthesize keystroke and mouse button events in response to a recognition. These synthesized events are sent to the window with pointer focus via the X protocol. Options -display displayName designates the X server screen on which to display spanel's GUI. -applicationDisplay displayName designates the X server screen to which spanel will send an action event to in response to a recognition. -vocabulary vocabFileName specifies a file (in spanel's vocabulary file format - probably originally created by spanel) which contains the desired templates to manipulate or match. The default filename suffix for vocabulary files is ".voc". -sound soundFileName the name of an aifc file to play in response to a recognition. By default, soundFileName is '/usr/lib/sounds/speechAck.aiff'. To disable this possibly annoying feature, specify '... -sound "" ...'. -recThr # specify the maximum difference (lowest score) between an unknown token and the best-matched template which will still qualify a recognition. The default is 400. -program programName spanel starts the specified program (forks and execs a child process). Spanel will terminate with the child. -background starts the spanel process but without displaying the GUI. Automatically starts spanel in action mode. -output spanel writes the action string associated with a template (or the template name if no action is specified) to stdout upon successful recognition. -topLevel tells spanel to send action events to the top level window (child of root) containing the window with pointer focus rather than sending the event directly to the window with pointer focus. -noAGC tells spanel not to use experimental automatic gain control algorithms. These algorithms attempt to find an optimum audio input level for the given signal and noise in the environment. Usually, the user can manually adjust these levels through experimentation using apanel's meters with better results than relying on AGC. -noSave tells spanel not to save templates to a central repository (currently lance.esd.sgi.com). This repository will enable users to quickly build custom robust speaker-independent vocabularies drawing on a database of previously trained words. GUI Through the menu bar the user may bring up a file submenu or a utility submenu. The file submenu has options for loading, saving, newing, and quitting. The utilities submenu has options for bringing up apanel and clearing previous template training. The top window is a scrollable status window which simply lists a few lines of textual comments like "saved vocabulary test.voc" and "added template zulu". Below the status window is a one line prompt window indicating what spanel expects the user to do next (like training prompts: "say the word 'zero'"). On the right below the prompt window is a multiline scrollable vocabulary window listing all the templates in the currently loaded vocabulary. The templates can be selected for various operations. On the left below the prompt window are three check boxes labeled action, score, and train. Selecting any one of them causes spanel to listen. When the action mode is selected, spanel will send events to the pointer window upon successful recognition. When the score mode is selected, spanel will display the score (distance from framed speech token) of each template in the vocabulary window and sort the templates by score (best/lowest first). When the train mode is selected, spanel uses the prompt window to request the utterance of a vocabulary word then trains the corresponding template with the next frame of speech from the user. Technically these modes are not mutually exclusive but in practice are seldomly used concurrently. The "correct" button is below the three check boxes and allows the user to perform training beyond what can be done in train mode. The "correct" button is active only in score mode and should be pushed when spanel makes an incorrect recognition and the user has selected the correct word from the vocabulary window. The result of correcting spanel is a train (with the already spoken utterance) of the correct (selected) word and an untrain of the incorrect word (the one spanel scored best). This will reduce the chance of spanel making the same mistake in the future. Note that correcting will state an untrain error (only informative - not harmful) if the word has not been trained with enough passes (30) for untraining to have pleasant effects. The delete button deletes the selected templates. The button below the delete button is used for modifying or adding a template name (depending on whether a template is selected or not, respectively). Pushing this button modifies or adds a template using the text in the field adjacent to this button on the right. Depressing the return key while the corresponding text field is active has the same effect as pushing the button. As a convention, template names should use lower-case and underscores between words. The button below the template button is used for specifying actions. Pushing this button (or depressing the return key while in the corresponding text field to the right) associates the action with the selected template. Actions are represented by text strings and are converted into X events. Each character of plain text is converted into at least one XKeyPress/XKeyRelease event pair (depending on whether or not the X event's KeyCode for the specified KeySym needs a shifted KeyCode). Symbolic names are enclosed in angle brackets (like <Escape>) and follow the X conventions found in /usr/include/X11/keysymdef.h. For instance the Control-D key commonly used for EOF to a shell is represented as an action with the string "<Control>d". Also "<Return>" will be used often at the ends of some action strings. Modifier keys, such as the "<Control>" in "<Control>d" are released after one plain (non-symbolic) character (the "d"). The backslash character can escape special characters such as the opening angle bracket '<' and itself '\'. spanel's graphical components will resize somewhat to the window. Capabilities The accuracy of spanel will vary greatly (from near-perfect to unusable) depending on the audio input. The biggest factor will be the microphone type and its placement. The Indigo microphone can work but has the disadvantage of picking up noise from all directions and of not having a fixed location. A uni-directional noise-cancelling headset microphone should be used to attain highest accuracy. However, the Indigo microphone can be held about four inches off the side of the mouth (to avoid wind noise) for an acceptable signal to noise ratio, or if the user is blessed with an unusually quiet work environment the Indigo microphone can be positioned on the desktop. The user should experiment to get best results. Generally, positioning the mic closer to the mouth will provide higher signal levels, while the keyboard and monitor may cause interference. The user should observe signal and noise levels using apanel's meters for estimating the performance of various microphone positions and for setting the gain. An experimental automatic gain control is implemented for situations where the signal and noise levels are not known ahead of time and can not be adjusted manually. Most users can adjust these levels better and should do so by invoking spanel with -noAGC and raising the sliders on apanel until spanel starts responding to normal environmental noise (usually just shy of full gain). If errors from spanel occur that indicate overflowing audio buffers (a maximum of four seconds of speech is allowed per utterance), the audio level is set too high and the algorithms are being incorrectly triggered by noise. The algorithms are currently capable of framing isolated speech. This mean each utterance or sound must be preceded and followed by some amount of silence (a few tenths of a second). Although the algorithms are speaker independent, the vocabulary is speaker dependent until it has been trained with various samples of speakers from the target audience. The algorithms will start working after four training passes and will continue to become more robust with hundreds of speakers making several training passes each. Limitations Training with spanel is not an optimal way to develop a vocabulary. An application-oriented training scenario would more accurately capture words as the speaker really says them in the course of using the application. This will require a training API or toolkit interface to the algorithms and cooperation from the application engineers. Bugs Changing input source sampling rate from anything but 8KHz can lead to unpredictable results. The source must be set correctly (usually the microphone) for spanel to operate correctly. Currently only selection of one template at a time is possible. Spanel always keeps one audio port open for input. When not running in background mode, spanel takes way too much of the CPU time due to improper handling of both audio and X input. After deleting templates, spanel gets the UI order somewhat mixed up. In order to prevent spanel from operating on the wrong template after a template selection in the UI, the vocabulary should be reloaded. This is a bad bug and will be fixed ASAP. Requirements Currently spanel only works on systems with support for the audio library. This is currently only the 4D30, the 4D35 and the Iris Indigo. Spanel uses one audio port for input and possibly one other if output sound acknowledgment is enabled. While listening in background mode, spanel requires approximately 5% of the CPU power, with a short burst when a framed token is matched against the templates in the currently loaded vocabulary. Also Related see apanel(1). In the future, an API with documentation will be available for programmatic interfacing to the recognition algorithms.